feat: add milvus vector db integration #419

AbhishekRP2002 · 2024-12-19T21:40:15Z

initial draft PR for basic milvus vector db integration, aiming [Feature Request]: Add support for Milvus Vector Database #415
handled minor typos that crossed my eyes in backend/modules/vector_db/mongo.py
@chiragjn Can u please help me in understanding the test setup that you guys follow for testing the functionalities / unit testing ?
This PR includes a basic implementation of Milvus vector DB integration ( via Milvus Lite predominantly with Quick setup ref: https://milvus.io/docs/create-collection-instantly.md#Quick-Setup )
Post successful merge of the basic implementation we can scope out the integration of additional extended configs for Milvus like it's done for Qdrant and provide more flexibility with Customized setup, I am thinking.

mnvsk97

Thanks for taking time to add a new vector db integration and making corrections to the mongo file. Please have a look at the requested changes.

backend/modules/vector_db/milvus.py

mnvsk97 · 2024-12-22T11:36:15Z

@AbhishekRP2002 Please add an example config in the compose.env and comment so anyone who wants to use Milvus can refer to it.

mnvsk97 · 2024-12-29T03:55:01Z

@AbhishekRP2002 PR looks good to me. Regarding your question about testing

Can u please help me in understanding the test setup that you guys follow for testing the functionalities / unit testing ?

We do not have a unit testing setup at the moment. Can you please post a few screenshots/videos showing that the basic operations work? If all looks good, we can go ahead and merge the PR.

AbhishekRP2002 · 2024-12-29T11:52:51Z

We do not have a unit testing setup at the moment. Can you please post a few screenshots/videos showing that the basic operations work? If all looks good, we can go ahead and merge the PR.

yes yes, I wrote a simple test.py script and encountered some bugs, I am fixing them and re-checking, will push the final changes in sometime.

AbhishekRP2002 · 2024-12-29T12:14:22Z

`model_config.yaml` file I used ollama for the embedding model

Here is the test.py script I used to test the basic functionalities:

from langchain_openai import OpenAIEmbeddings  # noqa
from langchain_ollama import OllamaEmbeddings
from langchain.docstore.document import Document
from backend.types import VectorDBConfig
from backend.modules.vector_db.milvus import MilvusVectorDB
from backend.logger import logger
from rich.pretty import pprint


logger.setLevel("DEBUG")


def test_milvus_functionality():
    # Initialize configuration for local Milvus
    config = VectorDBConfig(
        provider="milvus",
        local=True,
        url="",
        api_key="",
        config={"metric_type": "COSINE"},
    )

    # Initialize Milvus client
    milvus_client = MilvusVectorDB(config)

    # Initialize embeddings (you'll need your OpenAI API key)
    embeddings = OllamaEmbeddings(model="nomic-embed-text")

    input_text = "The meaning of life is 42"
    vector = embeddings.embed_query(input_text)
    print(vector[:3])

    # Test collection name
    collection_name = "test_collection"

    # Create test documents
    test_documents = [
        Document(
            page_content="This is a test document 1",
            metadata={"_data_point_fqn": "test/doc1", "_data_point_hash": "hash1"},
        ),
        Document(
            page_content="This is a test document 2",
            metadata={"_data_point_fqn": "test/doc2", "_data_point_hash": "hash2"},
        ),
    ]

    try:
        print("Starting Milvus functionality test...")

        # Test 1: Create collection
        print("\n1. Testing collection creation...")
        milvus_client.create_collection(collection_name, embeddings)
        print("✓ Collection created successfully")

        # Test 2: List collections
        print("\n2. Testing list collections...")
        collections = milvus_client.get_collections()
        pprint(milvus_client.milvus_client.describe_collection(collection_name))
        pprint(milvus_client.milvus_client.get_collection_stats(collection_name))
        assert collection_name in collections
        print(f"✓ Collections listed successfully: {collections}")

        # Test 3: Upsert documents
        print("\n3. Testing document upsert...")
        milvus_client.upsert_documents(collection_name, test_documents, embeddings)
        print("✓ Documents upserted successfully")

        pprint(
            milvus_client.milvus_client.query(
                collection_name=collection_name,
                filter="",
                limit=2,
                output_fields=["id", "metadata", "text"],
            )
        )

        # Test 4: List data point vectors
        print("\n4. Testing list data point vectors...")
        vectors = milvus_client.list_data_point_vectors(collection_name, "test/doc1")
        assert len(vectors) > 0
        print(f"✓ Retrieved {len(vectors)} vectors successfully")

        # Test 5: Delete specific vectors
        print("\n5. Testing vector deletion...")
        milvus_client.delete_data_point_vectors(collection_name, vectors)
        print("✓ Vectors deleted successfully")

        # Test 6: Clean up - Delete collection
        print("\n6. Testing collection deletion...")
        milvus_client.delete_collection(collection_name)
        collections = milvus_client.get_collections()
        assert collection_name not in collections
        print("✓ Collection deleted successfully")

        print("\nAll tests completed successfully!")

    except Exception as e:
        print(f"\n Test failed: {str(e)}")
        try:
            if collection_name in milvus_client.get_collections():
                milvus_client.delete_collection(collection_name)
        except Exception as e:
            print(f"Failed to clean up after failure: {str(e)}")
            pass
        raise e


if __name__ == "__main__":
    test_milvus_functionality()

Here is the output screenshot:

PS: since auto_id=True so even if there is no primary id or id field in the Document object, milvus will internally create a primary_id and assign respectively.

mnvsk97 · 2024-12-30T04:35:38Z

Thanks for the detailed explanation @AbhishekRP2002. Please let me know once the PR is ready with the updated test.py and we can go ahead and merge. Also, please resolve the review comments.

AbhishekRP2002 · 2024-12-30T05:07:09Z

Thanks for the detailed explanation @AbhishekRP2002. Please let me know once the PR is ready with the updated test.py and we can go ahead and merge. Also, please resolve the review comments.

Resolved the review comments. Updated test.py file as in ? I think I cannot add it in the codebase right ? test.py is under git ignore.

mnvsk97 · 2025-01-01T04:32:03Z

LGTM 🚀

Thanks for the detailed explanation @AbhishekRP2002. Please let me know once the PR is ready with the updated test.py and we can go ahead and merge. Also, please resolve the review comments.

Resolved the review comments. Updated test.py file as in ? I think I cannot add it in the codebase right ? test.py is under git ignore.

I think its best if we resolve testing as a separate setup later on. Looks good for now.

mnvsk97 · 2025-01-01T04:40:49Z

@AbhishekRP2002 Please resolve the precommit errors. To fix this issue, you can run something like pre-commit run --all-files.

AbhishekRP2002 · 2025-01-01T06:07:19Z

@AbhishekRP2002 Please resolve the precommit errors. To fix this issue, you can run something like pre-commit run --all-files.

Sure. Will do it.

AbhishekRP2002 · 2025-01-01T14:37:55Z

hi @mnvsk97 i have fixed the pre-commit errors.
and Happy New year to you, the team and your fam! have a great one

mnvsk97 · 2025-01-03T05:06:25Z

Merged. Thanks for the contribution. Happy new year !

jinhonglin-ryan

Hi, this is Jinhong from Milvus, based on the existing milvus integration for cognita, I have added a few more functionalities! Feel free to let me know what you think!

jinhonglin-ryan · 2025-01-04T03:59:05Z

backend/modules/vector_db/milvus.py

+            metric_type=self.metric_type,  # https://milvus.io/docs/metric.md#Metric-Types : check for other supported metrics
+            schema=schema,
+            auto_id=True,
+        )


Hi, this is Jinhong from Milvus. We have transitioned to the new MilvusClient interface and are actively working to phase out the use of the older ORM interface. In this context, your current approach to adding fields to a collection may be updated to the following:

schema = self.milvus_client.create_schema( auto_id=False, enable_dynamic_field=True ) schema.add_field( field_name="id", datatype=DataType.INT64, is_primary=True, ) schema.add_field( field_name="vector", datatype=DataType.FLOAT_VECTOR, dim=vector_size ) schema.add_field( field_name="text", datatype=DataType.VARCHAR, max_length=65535, ) schema.add_field( field_name="metadata", datatype=DataType.JSON, ) index_params = self.milvus_client.prepare_index_params() index_params.add_index( field_name="vector", index_type="FLAT", metric_type=self.metric_type, ) self.milvus_client.create_collection( collection_name=collection_name, schema=schema, index_params=index_params, )

jinhonglin-ryan · 2025-01-04T04:12:35Z

backend/modules/vector_db/milvus.py

+                collection_name=collection_name, filter=delete_expr
+            )
+
+        logger.debug(f"[Milvus] Deleted {len(data_point_vectors)} data point vectors")


Consider adding the following 3 functions: list_documents_in_collection, delete_documents, and list_document_vector_points Here is my implementation based on your code:

def list_documents_in_collection( self, collection_name: str, base_document_id: str = None ) -> List[str]: """ List all documents in a collection. Args: collection_name (str): The name of the collection. base_document_id (str, optional): Base document ID for filtering. Defaults to None. Returns: List[str]: List of document IDs. """ logger.debug( f"[Milvus] Listing all documents with base document ID {base_document_id} for collection {collection_name}" ) stop = False offset = 0 document_ids_set = set() while not stop: filter_expr = ( f'metadata["{DATA_POINT_FQN_METADATA_KEY}"] == "{base_document_id}"' if base_document_id else "" ) search_result = self.milvus_client.query( collection_name=collection_name, filter=filter_expr, output_fields=["metadata"], limit=BATCH_SIZE, offset=offset, ) if not search_result: stop = True break for doc in search_result: metadata = doc.get("metadata", {}) if metadata.get(DATA_POINT_FQN_METADATA_KEY): document_ids_set.add(metadata.get(DATA_POINT_FQN_METADATA_KEY)) if len(document_ids_set) > MAX_SCROLL_LIMIT: stop = True break if len(search_result) < BATCH_SIZE: stop = True else: offset += BATCH_SIZE logger.debug( f"[Milvus] Found {len(document_ids_set)} documents with base document ID {base_document_id} in collection {collection_name}" ) return list(document_ids_set)

def delete_documents(self, collection_name: str, document_ids: List[str]): """ Delete documents from a collection based on document IDs. Args: collection_name (str): The name of the collection. document_ids (List[str]): List of document IDs to delete. """ logger.debug( f"[Milvus] Deleting {len(document_ids)} documents from collection {collection_name}" ) if not document_ids: logger.warning("[Milvus] No document IDs provided for deletion.") return try: for i in range(0, len(document_ids), BATCH_SIZE): document_ids_to_delete = document_ids[i : i + BATCH_SIZE] delete_expr = " or ".join( [ f'metadata["{DATA_POINT_FQN_METADATA_KEY}"] == "{doc_id}"' for doc_id in document_ids_to_delete ] ) self.milvus_client.delete( collection_name=collection_name, filter=delete_expr ) logger.debug( f"[Milvus] Deleted {len(document_ids)} documents from collection {collection_name}" ) except Exception as exp: logger.error(f"[Milvus] Error deleting documents: {exp}")

def list_document_vector_points( self, collection_name: str ) -> List[DataPointVector]: """ List all document vector points in a collection. Args: collection_name (str): The name of the collection. Returns: List[DataPointVector]: List of vector points with metadata. """ logger.debug( f"[Milvus] Listing all document vector points for collection {collection_name}" ) stop = False offset = 0 document_vector_points: List[DataPointVector] = [] while not stop: search_result = self.milvus_client.query( collection_name=collection_name, output_fields=["id", "metadata"], limit=BATCH_SIZE, offset=offset, ) if not search_result: stop = True break for doc in search_result: metadata = doc.get("metadata", {}) if metadata.get(DATA_POINT_FQN_METADATA_KEY) and metadata.get( DATA_POINT_HASH_METADATA_KEY ): document_vector_points.append( DataPointVector( data_point_vector_id=str(doc["id"]), data_point_fqn=metadata.get(DATA_POINT_FQN_METADATA_KEY), data_point_hash=metadata.get(DATA_POINT_HASH_METADATA_KEY), ) ) if len(document_vector_points) > MAX_SCROLL_LIMIT: stop = True break if len(search_result) < BATCH_SIZE: stop = True else: offset += BATCH_SIZE logger.debug( f"[Milvus] Listed {len(document_vector_points)} document vector points for collection {collection_name}" ) return document_vector_points

feat: add milvus vector db integration

388fd05

mnvsk97 requested changes Dec 20, 2024

View reviewed changes

chore: requested changes addressed

80b1512

add example config in compose.env

3625df7

fix: bug fixes encountered during testing

9f7b060

Merge branch 'main' into main

957053a

mnvsk97 enabled auto-merge January 1, 2025 04:31

AbhishekRP2002 added 2 commits January 1, 2025 20:01

chore: fix pre-commit checks

f94b4ff

Merge branch 'main' of github.com-personal:AbhishekRP2002/cognita

1258f3a

auto-merge was automatically disabled January 1, 2025 14:32
Head branch was pushed to by a user without write access

mnvsk97 enabled auto-merge January 3, 2025 05:03

mnvsk97 self-assigned this Jan 3, 2025

mnvsk97 approved these changes Jan 3, 2025

View reviewed changes

mnvsk97 merged commit 431d682 into truefoundry:main Jan 3, 2025
1 check passed

jinhonglin-ryan reviewed Jan 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add milvus vector db integration #419

feat: add milvus vector db integration #419

AbhishekRP2002 commented Dec 19, 2024 •

edited

Loading

mnvsk97 left a comment

mnvsk97 commented Dec 22, 2024

mnvsk97 commented Dec 29, 2024 •

edited

Loading

AbhishekRP2002 commented Dec 29, 2024 •

edited

Loading

AbhishekRP2002 commented Dec 29, 2024 •

edited

Loading

mnvsk97 commented Dec 30, 2024

AbhishekRP2002 commented Dec 30, 2024

mnvsk97 commented Jan 1, 2025

mnvsk97 commented Jan 1, 2025

AbhishekRP2002 commented Jan 1, 2025

AbhishekRP2002 commented Jan 1, 2025

mnvsk97 commented Jan 3, 2025

jinhonglin-ryan left a comment

jinhonglin-ryan Jan 4, 2025

jinhonglin-ryan Jan 4, 2025

feat: add milvus vector db integration #419

feat: add milvus vector db integration #419

Conversation

AbhishekRP2002 commented Dec 19, 2024 • edited Loading

mnvsk97 left a comment

Choose a reason for hiding this comment

mnvsk97 commented Dec 22, 2024

mnvsk97 commented Dec 29, 2024 • edited Loading

AbhishekRP2002 commented Dec 29, 2024 • edited Loading

AbhishekRP2002 commented Dec 29, 2024 • edited Loading

mnvsk97 commented Dec 30, 2024

AbhishekRP2002 commented Dec 30, 2024

mnvsk97 commented Jan 1, 2025

mnvsk97 commented Jan 1, 2025

AbhishekRP2002 commented Jan 1, 2025

AbhishekRP2002 commented Jan 1, 2025

mnvsk97 commented Jan 3, 2025

jinhonglin-ryan left a comment

Choose a reason for hiding this comment

jinhonglin-ryan Jan 4, 2025

Choose a reason for hiding this comment

jinhonglin-ryan Jan 4, 2025

Choose a reason for hiding this comment

AbhishekRP2002 commented Dec 19, 2024 •

edited

Loading

mnvsk97 commented Dec 29, 2024 •

edited

Loading

AbhishekRP2002 commented Dec 29, 2024 •

edited

Loading

AbhishekRP2002 commented Dec 29, 2024 •

edited

Loading